Table of Contents in this notebook
Welcome to our RentSafeTO project! In this capstone, we aim to predict the safety assessment of apartment buildings in Toronto using machine learning. Our goal is to provide valuable insights to property owners, empowering them to make informed decisions about housing maintain.
The Big Idea:
RentSafeTO revolves around analyzing factors that influence building safety. By considering variables such as building height, construction year, population density, laundry facilities, and waste disposal, our predictive model aims to forecast the safety evaluation outcomes.
The Impact:
The impact of RentSafeTO extends to multiple stakeholders. Prospective property buyers can benefit from a risk assessment before purchasing an apartment building. Tenants planning to move can access safety information to choose safer living environments. Landlords can proactively address safety concerns, leading to a more secure rental market.
The Data:
Our project utilizes data from the Toronto open data site, encompassing various apartment building evaluations. By analyzing this dataset, we aim to uncover patterns and correlations, enabling our predictive model to make accurate safety assessments.
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
Sometimes a warning comes out because it's slightly different from version to version.
from warnings import filterwarnings
filterwarnings('ignore')
# Read Data
df = pd.read_csv('../data/Apartment Building Evaluation.csv')
It's time to check our data.
# Sanity Check
df.head()
| _id | RSN | YEAR_REGISTERED | YEAR_EVALUATED | YEAR_BUILT | PROPERTY_TYPE | WARD | WARDNAME | SITE_ADDRESS | CONFIRMED_STOREYS | ... | EXTERIOR_WALKWAYS | BALCONY_GUARDS | WATER_PEN_EXT_BLDG_ELEMENTS | PARKING_AREA | OTHER_FACILITIES | GRID | LATITUDE | LONGITUDE | X | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4167486 | 4304347 | NaN | NaN | 1999.0 | PRIVATE | 2 | Etobicoke Centre | ** CREATED IN ERROR ** 399 THE WEST MALL | 22 | ... | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | W0233 | 43.643781 | -79.565456 | 299503.625 | 4833538.964 |
| 1 | 4167487 | 5157421 | 2023.0 | NaN | 1973.0 | TCHC | 17 | Don Valley North | 6 TREE SPARROWAY | 4 | ... | 3.0 | 5.0 | 4.0 | 3.0 | 4.0 | N1721 | 43.791384 | -79.369630 | 315272.148 | 4849932.515 |
| 2 | 4167488 | 5156814 | 2023.0 | NaN | 1973.0 | TCHC | 17 | Don Valley North | 13 FIELD SPARROWAY | 4 | ... | 4.0 | 5.0 | 4.0 | 3.0 | 4.0 | N1721 | 43.790920 | -79.368771 | 315334.815 | 4849906.373 |
| 3 | 4167489 | 5157387 | 2023.0 | NaN | 1973.0 | TCHC | 17 | Don Valley North | 4 TREE SPARROWAY | 4 | ... | 3.0 | 5.0 | 4.0 | 3.0 | 4.0 | N1721 | 43.791448 | -79.369332 | 315291.755 | 4849938.162 |
| 4 | 4167490 | 5156871 | 2023.0 | NaN | 1973.0 | TCHC | 17 | Don Valley North | 2 TREE SPARROWAY | 4 | ... | 5.0 | 5.0 | 4.0 | 3.0 | 4.0 | N1721 | 43.791511 | -79.369045 | 315330.308 | 4849947.465 |
5 rows × 40 columns
df.shape
(11760, 40)
In this dataset, we have 11760 rows and 40 coulumns.
for i in df.columns:
print
print(i)
_id RSN YEAR_REGISTERED YEAR_EVALUATED YEAR_BUILT PROPERTY_TYPE WARD WARDNAME SITE_ADDRESS CONFIRMED_STOREYS CONFIRMED_UNITS EVALUATION_COMPLETED_ON SCORE RESULTS_OF_SCORE NO_OF_AREAS_EVALUATED ENTRANCE_LOBBY ENTRANCE_DOORS_WINDOWS SECURITY STAIRWELLS LAUNDRY_ROOMS INTERNAL_GUARDS_HANDRAILS GARBAGE_CHUTE_ROOMS GARBAGE_BIN_STORAGE_AREA ELEVATORS STORAGE_AREAS_LOCKERS INTERIOR_WALL_CEILING_FLOOR INTERIOR_LIGHTING_LEVELS GRAFFITI EXTERIOR_CLADDING EXTERIOR_GROUNDS EXTERIOR_WALKWAYS BALCONY_GUARDS WATER_PEN_EXT_BLDG_ELEMENTS PARKING_AREA OTHER_FACILITIES GRID LATITUDE LONGITUDE X Y
We have 40 columns, let's describe and sort those.
Columns:
Building Information:
Location Information:
Building Details:
Exterior Maintenance:
Interior Maintenance:
Common Area Maintenance:
Building Hygiene:
Building Services:
Others:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11760 entries, 0 to 11759 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 11760 non-null int64 1 RSN 11760 non-null int64 2 YEAR_REGISTERED 11455 non-null float64 3 YEAR_EVALUATED 9751 non-null float64 4 YEAR_BUILT 11714 non-null float64 5 PROPERTY_TYPE 11760 non-null object 6 WARD 11760 non-null int64 7 WARDNAME 11760 non-null object 8 SITE_ADDRESS 11760 non-null object 9 CONFIRMED_STOREYS 11760 non-null int64 10 CONFIRMED_UNITS 11760 non-null int64 11 EVALUATION_COMPLETED_ON 11760 non-null object 12 SCORE 11760 non-null int64 13 RESULTS_OF_SCORE 11760 non-null object 14 NO_OF_AREAS_EVALUATED 11760 non-null int64 15 ENTRANCE_LOBBY 11758 non-null float64 16 ENTRANCE_DOORS_WINDOWS 11759 non-null float64 17 SECURITY 11754 non-null float64 18 STAIRWELLS 11757 non-null float64 19 LAUNDRY_ROOMS 11104 non-null float64 20 INTERNAL_GUARDS_HANDRAILS 11757 non-null float64 21 GARBAGE_CHUTE_ROOMS 5102 non-null float64 22 GARBAGE_BIN_STORAGE_AREA 11749 non-null float64 23 ELEVATORS 6897 non-null float64 24 STORAGE_AREAS_LOCKERS 4773 non-null float64 25 INTERIOR_WALL_CEILING_FLOOR 11758 non-null float64 26 INTERIOR_LIGHTING_LEVELS 11758 non-null float64 27 GRAFFITI 11721 non-null float64 28 EXTERIOR_CLADDING 11751 non-null float64 29 EXTERIOR_GROUNDS 11745 non-null float64 30 EXTERIOR_WALKWAYS 11754 non-null float64 31 BALCONY_GUARDS 7973 non-null float64 32 WATER_PEN_EXT_BLDG_ELEMENTS 11754 non-null float64 33 PARKING_AREA 10704 non-null float64 34 OTHER_FACILITIES 2254 non-null float64 35 GRID 11760 non-null object 36 LATITUDE 11533 non-null float64 37 LONGITUDE 11533 non-null float64 38 X 11671 non-null float64 39 Y 11671 non-null float64 dtypes: float64(27), int64(7), object(6) memory usage: 3.6+ MB
We can recognize that they have NA in some columns and the type of some columns should be adjusted. The Float type columns will change to Int after clean NA rows.
df['EVALUATION_COMPLETED_ON'] = pd.to_datetime(df['EVALUATION_COMPLETED_ON']).dt.year
Change to year because it is more efficient to represent only the year rather than the exact date evaluated.
df.duplicated().sum()
0
And there is not duplicated row each one as unique.
for i in df.columns:
print('number of distinct in', i)
print(':', df[i].nunique())
number of distinct in _id : 11760 number of distinct in RSN : 3513 number of distinct in YEAR_REGISTERED : 7 number of distinct in YEAR_EVALUATED : 5 number of distinct in YEAR_BUILT : 130 number of distinct in PROPERTY_TYPE : 3 number of distinct in WARD : 25 number of distinct in WARDNAME : 25 number of distinct in SITE_ADDRESS : 3513 number of distinct in CONFIRMED_STOREYS : 40 number of distinct in CONFIRMED_UNITS : 383 number of distinct in EVALUATION_COMPLETED_ON : 7 number of distinct in SCORE : 66 number of distinct in RESULTS_OF_SCORE : 4 number of distinct in NO_OF_AREAS_EVALUATED : 11 number of distinct in ENTRANCE_LOBBY : 5 number of distinct in ENTRANCE_DOORS_WINDOWS : 5 number of distinct in SECURITY : 5 number of distinct in STAIRWELLS : 5 number of distinct in LAUNDRY_ROOMS : 5 number of distinct in INTERNAL_GUARDS_HANDRAILS : 5 number of distinct in GARBAGE_CHUTE_ROOMS : 5 number of distinct in GARBAGE_BIN_STORAGE_AREA : 5 number of distinct in ELEVATORS : 5 number of distinct in STORAGE_AREAS_LOCKERS : 5 number of distinct in INTERIOR_WALL_CEILING_FLOOR : 5 number of distinct in INTERIOR_LIGHTING_LEVELS : 5 number of distinct in GRAFFITI : 5 number of distinct in EXTERIOR_CLADDING : 5 number of distinct in EXTERIOR_GROUNDS : 5 number of distinct in EXTERIOR_WALKWAYS : 5 number of distinct in BALCONY_GUARDS : 5 number of distinct in WATER_PEN_EXT_BLDG_ELEMENTS : 5 number of distinct in PARKING_AREA : 5 number of distinct in OTHER_FACILITIES : 5 number of distinct in GRID : 327 number of distinct in LATITUDE : 3414 number of distinct in LONGITUDE : 3414 number of distinct in X : 3473 number of distinct in Y : 3473
Most of the columns are scored from 1 to 5. This means that the columns with scores are the main sources used for judgment.
# There are four following results as a result of the evaluation score.
df['RESULTS_OF_SCORE'].value_counts()
RESULTS_OF_SCORE Evaluation needs to be conducted in 2 years 7396 Evaluation needs to be conducted in 1 year 2619 Evaluation needs to be conducted in 3 years 1628 Building Audit 117 Name: count, dtype: int64
As a target 'SCORE' shows the results in score, and 'RESULTS_OF_SCORE' tells the results of the revaluation period that it receives.
When we check our data, we saw the nessesity to clean. Let's clean.
df.isna().sum()/df.shape[0]*100
_id 0.000000 RSN 0.000000 YEAR_REGISTERED 2.593537 YEAR_EVALUATED 17.083333 YEAR_BUILT 0.391156 PROPERTY_TYPE 0.000000 WARD 0.000000 WARDNAME 0.000000 SITE_ADDRESS 0.000000 CONFIRMED_STOREYS 0.000000 CONFIRMED_UNITS 0.000000 EVALUATION_COMPLETED_ON 0.000000 SCORE 0.000000 RESULTS_OF_SCORE 0.000000 NO_OF_AREAS_EVALUATED 0.000000 ENTRANCE_LOBBY 0.017007 ENTRANCE_DOORS_WINDOWS 0.008503 SECURITY 0.051020 STAIRWELLS 0.025510 LAUNDRY_ROOMS 5.578231 INTERNAL_GUARDS_HANDRAILS 0.025510 GARBAGE_CHUTE_ROOMS 56.615646 GARBAGE_BIN_STORAGE_AREA 0.093537 ELEVATORS 41.352041 STORAGE_AREAS_LOCKERS 59.413265 INTERIOR_WALL_CEILING_FLOOR 0.017007 INTERIOR_LIGHTING_LEVELS 0.017007 GRAFFITI 0.331633 EXTERIOR_CLADDING 0.076531 EXTERIOR_GROUNDS 0.127551 EXTERIOR_WALKWAYS 0.051020 BALCONY_GUARDS 32.202381 WATER_PEN_EXT_BLDG_ELEMENTS 0.051020 PARKING_AREA 8.979592 OTHER_FACILITIES 80.833333 GRID 0.000000 LATITUDE 1.930272 LONGITUDE 1.930272 X 0.756803 Y 0.756803 dtype: float64
As we can see,
It mean we can drop those columns because cleaned dataset make more accurate analysis.
df_clean = df.drop(['RSN','WARD','SITE_ADDRESS','YEAR_EVALUATED','X','Y','GARBAGE_CHUTE_ROOMS', 'ELEVATORS', 'STORAGE_AREAS_LOCKERS', 'OTHER_FACILITIES'], axis=1)
# check again
df_clean.isna().sum()/df_clean.shape[0]*100
_id 0.000000 YEAR_REGISTERED 2.593537 YEAR_BUILT 0.391156 PROPERTY_TYPE 0.000000 WARDNAME 0.000000 CONFIRMED_STOREYS 0.000000 CONFIRMED_UNITS 0.000000 EVALUATION_COMPLETED_ON 0.000000 SCORE 0.000000 RESULTS_OF_SCORE 0.000000 NO_OF_AREAS_EVALUATED 0.000000 ENTRANCE_LOBBY 0.017007 ENTRANCE_DOORS_WINDOWS 0.008503 SECURITY 0.051020 STAIRWELLS 0.025510 LAUNDRY_ROOMS 5.578231 INTERNAL_GUARDS_HANDRAILS 0.025510 GARBAGE_BIN_STORAGE_AREA 0.093537 INTERIOR_WALL_CEILING_FLOOR 0.017007 INTERIOR_LIGHTING_LEVELS 0.017007 GRAFFITI 0.331633 EXTERIOR_CLADDING 0.076531 EXTERIOR_GROUNDS 0.127551 EXTERIOR_WALKWAYS 0.051020 BALCONY_GUARDS 32.202381 WATER_PEN_EXT_BLDG_ELEMENTS 0.051020 PARKING_AREA 8.979592 GRID 0.000000 LATITUDE 1.930272 LONGITUDE 1.930272 dtype: float64
The rest of the NA will be replaced in a different way. Firstly, It's better to look at the items evaluated from 1 to 5.
# Columns evaluated in five digits.
df_scored = df_clean[['ENTRANCE_LOBBY',
'ENTRANCE_DOORS_WINDOWS',
'SECURITY',
'STAIRWELLS',
'LAUNDRY_ROOMS',
'INTERNAL_GUARDS_HANDRAILS',
'GARBAGE_BIN_STORAGE_AREA',
'INTERIOR_WALL_CEILING_FLOOR',
'INTERIOR_LIGHTING_LEVELS',
'GRAFFITI',
'EXTERIOR_CLADDING',
'EXTERIOR_GROUNDS',
'EXTERIOR_WALKWAYS',
'BALCONY_GUARDS',
'WATER_PEN_EXT_BLDG_ELEMENTS',
'PARKING_AREA']]
df_scored.describe()
| ENTRANCE_LOBBY | ENTRANCE_DOORS_WINDOWS | SECURITY | STAIRWELLS | LAUNDRY_ROOMS | INTERNAL_GUARDS_HANDRAILS | GARBAGE_BIN_STORAGE_AREA | INTERIOR_WALL_CEILING_FLOOR | INTERIOR_LIGHTING_LEVELS | GRAFFITI | EXTERIOR_CLADDING | EXTERIOR_GROUNDS | EXTERIOR_WALKWAYS | BALCONY_GUARDS | WATER_PEN_EXT_BLDG_ELEMENTS | PARKING_AREA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 11758.000000 | 11759.000000 | 11754.000000 | 11757.000000 | 11104.000000 | 11757.000000 | 11749.000000 | 11758.000000 | 11758.000000 | 11721.000000 | 11751.000000 | 11745.000000 | 11754.000000 | 7973.000000 | 11754.000000 | 10704.000000 |
| mean | 3.713642 | 3.675313 | 4.126425 | 3.453857 | 3.575919 | 3.603640 | 3.607201 | 3.492686 | 3.672393 | 4.610869 | 3.549060 | 3.650575 | 3.643866 | 3.752665 | 3.668453 | 3.392096 |
| std | 0.775948 | 0.770057 | 0.877997 | 0.787374 | 0.794015 | 0.830116 | 0.782764 | 0.767906 | 0.878231 | 0.755874 | 0.718478 | 0.754074 | 0.744887 | 0.833194 | 0.739714 | 0.757125 |
| min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 |
| 50% | 4.000000 | 4.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 | 4.000000 | 3.000000 | 4.000000 | 5.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 3.000000 |
| 75% | 4.000000 | 4.000000 | 5.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 5.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 |
| max | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 |
Most of scores are over 3, but we should care the scores under 3 because we should find an environment where we get low scores, result is the audit.
32.20% of the value of 'BALCONY_GUARDS', 5.58% of the value of 'LAUNDRY_ROOMS', and 8.98% of the value of 'PARKING_AREA' are missing. We will replace those with the mode
Additionaly, All other rows with missing values will be dropped.
most_freq1 = df_clean['BALCONY_GUARDS'].value_counts().idxmax()
most_freq2 = df_clean['LAUNDRY_ROOMS'].value_counts().idxmax()
most_freq3 = df_clean['PARKING_AREA'].value_counts().idxmax()
df_clean['BALCONY_GUARDS'].fillna(most_freq1, inplace = True)
df_clean['LAUNDRY_ROOMS'].fillna(most_freq2, inplace = True)
df_clean['PARKING_AREA'].fillna(most_freq3, inplace = True)
df_clean.dropna(inplace=True)
# Clean data sanity check
df_clean.isna().sum()/df_clean.shape[0]*100
_id 0.0 YEAR_REGISTERED 0.0 YEAR_BUILT 0.0 PROPERTY_TYPE 0.0 WARDNAME 0.0 CONFIRMED_STOREYS 0.0 CONFIRMED_UNITS 0.0 EVALUATION_COMPLETED_ON 0.0 SCORE 0.0 RESULTS_OF_SCORE 0.0 NO_OF_AREAS_EVALUATED 0.0 ENTRANCE_LOBBY 0.0 ENTRANCE_DOORS_WINDOWS 0.0 SECURITY 0.0 STAIRWELLS 0.0 LAUNDRY_ROOMS 0.0 INTERNAL_GUARDS_HANDRAILS 0.0 GARBAGE_BIN_STORAGE_AREA 0.0 INTERIOR_WALL_CEILING_FLOOR 0.0 INTERIOR_LIGHTING_LEVELS 0.0 GRAFFITI 0.0 EXTERIOR_CLADDING 0.0 EXTERIOR_GROUNDS 0.0 EXTERIOR_WALKWAYS 0.0 BALCONY_GUARDS 0.0 WATER_PEN_EXT_BLDG_ELEMENTS 0.0 PARKING_AREA 0.0 GRID 0.0 LATITUDE 0.0 LONGITUDE 0.0 dtype: float64
float_columns = [
'YEAR_REGISTERED',
'YEAR_BUILT',
'ENTRANCE_LOBBY',
'ENTRANCE_DOORS_WINDOWS',
'SECURITY',
'STAIRWELLS',
'LAUNDRY_ROOMS',
'INTERNAL_GUARDS_HANDRAILS',
'GARBAGE_BIN_STORAGE_AREA',
'INTERIOR_WALL_CEILING_FLOOR',
'INTERIOR_LIGHTING_LEVELS',
'GRAFFITI',
'EXTERIOR_CLADDING',
'EXTERIOR_GROUNDS',
'EXTERIOR_WALKWAYS',
'BALCONY_GUARDS',
'WATER_PEN_EXT_BLDG_ELEMENTS',
'PARKING_AREA']
df_clean[float_columns] = df_clean[float_columns].astype(int)
Now that there is no NA, we will fix the data type.
df_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 11152 entries, 1 to 11759 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 11152 non-null int64 1 YEAR_REGISTERED 11152 non-null int64 2 YEAR_BUILT 11152 non-null int64 3 PROPERTY_TYPE 11152 non-null object 4 WARDNAME 11152 non-null object 5 CONFIRMED_STOREYS 11152 non-null int64 6 CONFIRMED_UNITS 11152 non-null int64 7 EVALUATION_COMPLETED_ON 11152 non-null int32 8 SCORE 11152 non-null int64 9 RESULTS_OF_SCORE 11152 non-null object 10 NO_OF_AREAS_EVALUATED 11152 non-null int64 11 ENTRANCE_LOBBY 11152 non-null int64 12 ENTRANCE_DOORS_WINDOWS 11152 non-null int64 13 SECURITY 11152 non-null int64 14 STAIRWELLS 11152 non-null int64 15 LAUNDRY_ROOMS 11152 non-null int64 16 INTERNAL_GUARDS_HANDRAILS 11152 non-null int64 17 GARBAGE_BIN_STORAGE_AREA 11152 non-null int64 18 INTERIOR_WALL_CEILING_FLOOR 11152 non-null int64 19 INTERIOR_LIGHTING_LEVELS 11152 non-null int64 20 GRAFFITI 11152 non-null int64 21 EXTERIOR_CLADDING 11152 non-null int64 22 EXTERIOR_GROUNDS 11152 non-null int64 23 EXTERIOR_WALKWAYS 11152 non-null int64 24 BALCONY_GUARDS 11152 non-null int64 25 WATER_PEN_EXT_BLDG_ELEMENTS 11152 non-null int64 26 PARKING_AREA 11152 non-null int64 27 GRID 11152 non-null object 28 LATITUDE 11152 non-null float64 29 LONGITUDE 11152 non-null float64 dtypes: float64(2), int32(1), int64(23), object(4) memory usage: 2.6+ MB
I'm finally done cleaning!
Data Cleaning (Summary):
We will do visualization to look around the data efficiently.
import matplotlib.pyplot as plt
Columns were classified before running EDA. This is because using 40 rows for analysis at once is inefficient and cannot guarantee accurate results.
Of course, hygiene or additional facilities can also be indirectly important. But we want to look at more direct factors for safety.
Relationship between basic apartment information and score, not evaluation score
Numerical data - continuous
Numerical data - discrete
Categorical - nominal
target Variable
from scipy import stats
import statsmodels.api as sm
Y = df_clean['SCORE'].values
for i in ['YEAR_REGISTERED', 'YEAR_BUILT', 'CONFIRMED_STOREYS', 'CONFIRMED_UNITS', 'NO_OF_AREAS_EVALUATED']:
print('*', i)
X=df_clean[i].values
print('Correlation:', stats.pearsonr(X,Y)[0])
print('P-value:', stats.pearsonr(X,Y)[1])
print('\n')
* YEAR_REGISTERED Correlation: -0.04694172099864394 P-value: 7.072356943071497e-07 * YEAR_BUILT Correlation: 0.16873709868865364 P-value: 5.160153615173216e-72 * CONFIRMED_STOREYS Correlation: 0.12568179491929196 P-value: 1.6842857551320354e-40 * CONFIRMED_UNITS Correlation: 0.09936299669697163 P-value: 7.145319371979502e-26 * NO_OF_AREAS_EVALUATED Correlation: 0.23153736362899446 P-value: 1.274036340575308e-135
Each of the P-value of the columns related to apartment information is significantly small. It means it is worth
plt.figure()
df_clean.boxplot(column=['SCORE'], by=['YEAR_REGISTERED'])
plt.show()
<Figure size 640x480 with 0 Axes>
Looking at the boxplot of the scores by year of registration, the median value is similar, but the older the apartment is registered, the more outliers the lower the score. Data for 2023 looks different because it is not yet a year old.
plt.figure()
Y = df_clean['SCORE'].values
X = df_clean['YEAR_BUILT'].values
plt.scatter(X, Y, alpha=0.3)
plt.title('YEAR_BUILT')
plt.show()
Although it is difficult to find any noticeable characteristics, it can be seen that the score of buildings built around the 2000s is less than 60 points compared to those around the 1950s.
Y = df_clean['SCORE'].values
X = df_clean['CONFIRMED_STOREYS'].values
plt.scatter(X, Y, alpha=0.5)
plt.title('CONFIRMED_STOREYS')
plt.show()
Y = df_clean['SCORE'].values
X = df_clean['CONFIRMED_UNITS'].values
plt.scatter(X, Y, alpha=0.5)
plt.title('CONFIRMED_UNITS')
plt.show()
It can be seen that the larger the number of units and storeys, the narrower the radius of the score distribution. Also, the lower the number, the lower the score.
df_clean.boxplot(column=['SCORE'], by=['NO_OF_AREAS_EVALUATED'])
<Axes: title={'center': 'SCORE'}, xlabel='[NO_OF_AREAS_EVALUATED]'>
Let's look at the category data.
plt.figure()
plt.title('PROPERTY_TYPE')
property_counts = df_clean['PROPERTY_TYPE'].value_counts()
ax = df_clean['PROPERTY_TYPE'].value_counts().plot.bar()
for i, count in enumerate(property_counts):
ax.annotate(str(count), xy=(i, count), ha='center', va='bottom')
plt.show()
Most buildings are privately owned.
plt.figure(figsize = (12,6))
plt.title('WARDNAME')
wardname_counts = df_clean['WARDNAME'].value_counts()
ax = df_clean['WARDNAME'].value_counts().plot.bar()
for i, count in enumerate(wardname_counts):
ax.annotate(str(count), xy=(i, count), ha='center', va='bottom')
#plt.savefig('WARDNAME.png')
plt.show()
"Toronto-St. Paul's" has the most buildings, followed by "Eglinton-Lawrence" and "Etobicoke-Lakeshore".
Another package is needed for map visualization.
!pip install folium
Requirement already satisfied: folium in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (0.14.0) Requirement already satisfied: numpy in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from folium) (1.23.5) Requirement already satisfied: jinja2>=2.9 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from folium) (3.1.2) Requirement already satisfied: requests in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from folium) (2.28.1) Requirement already satisfied: branca>=0.6.0 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from folium) (0.6.0) Requirement already satisfied: MarkupSafe>=2.0 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from jinja2>=2.9->folium) (2.1.1) Requirement already satisfied: idna<4,>=2.5 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from requests->folium) (3.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from requests->folium) (1.26.14) Requirement already satisfied: charset-normalizer<3,>=2 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from requests->folium) (2.0.4) Requirement already satisfied: certifi>=2017.4.17 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from requests->folium) (2023.7.22)
import folium
# Latitude of Toronto
latitude = 43.651070
# Longtitude of Toronto
longitude = -79.347015
# the points of apartments
location_score = df_clean[['LATITUDE', 'LONGITUDE','SCORE']]
from folium.plugins import MarkerCluster
m = folium.Map(location=[latitude, longitude],
zoom_start=13,
width=750,
height=500
)
# CN Tower location
folium.Marker([43.642567, -79.387054],
popup='CN Tower',
tooltip='Landmark of Toronto').add_to(m)
location = df_clean[['LATITUDE', 'LONGITUDE']]
marker_cluster = MarkerCluster().add_to(m)
for lat, long in zip(location['LATITUDE'], location['LONGITUDE']):
folium.Marker([lat, long], icon = folium.Icon(color="green")).add_to(marker_cluster)
# m.save('zoom.html')
m
from folium.plugins import HeatMap
m = folium.Map(location=[latitude, longitude],
zoom_start=13,
width=750,
height=500
)
# CN Tower location
folium.Marker([43.642567, -79.387054],
popup='CN Tower',
tooltip='Landmark of Toronto').add_to(m)
location = df_clean[['LATITUDE', 'LONGITUDE']]
data = location.values.tolist()
heatmap = HeatMap(data,
min_opacity=0.05,
max_opacity=0.9,
radius=25)
heatmap.add_to(m)
# m.save('heatmap.html')
m
This time we will look at the EDA for the items evaluated.
subset_columns = ['ENTRANCE_LOBBY', 'ENTRANCE_DOORS_WINDOWS', 'SECURITY', 'STAIRWELLS', 'LAUNDRY_ROOMS',
'INTERNAL_GUARDS_HANDRAILS', 'GARBAGE_BIN_STORAGE_AREA', 'INTERIOR_WALL_CEILING_FLOOR',
'INTERIOR_LIGHTING_LEVELS', 'GRAFFITI', 'EXTERIOR_CLADDING', 'EXTERIOR_GROUNDS',
'EXTERIOR_WALKWAYS', 'BALCONY_GUARDS', 'WATER_PEN_EXT_BLDG_ELEMENTS', 'PARKING_AREA']
subset_df = df_clean[subset_columns]
sns.set(style="whitegrid")
f, axes = plt.subplots(4, 4, figsize=(15, 12))
axes = axes.ravel()
# Plot histograms for each column
for i, col in enumerate(subset_columns):
sns.histplot(data=subset_df, x=col, ax=axes[i], kde=True)
axes[i].set_title(col)
axes[i].set_xlabel(col)
axes[i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
The name will be made starting with a unit of 00.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
In this model, the variable name is 1
X1 = df_clean[['YEAR_REGISTERED', 'YEAR_BUILT', 'CONFIRMED_STOREYS', 'CONFIRMED_UNITS', 'NO_OF_AREAS_EVALUATED']]
y1 = df_clean['SCORE']
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, random_state=42)
#model
model1 = LinearRegression()
#fit
model1.fit(X_train1, y_train1)
#predict
y_pred1 = model1.predict(X_test1)
y_pred1
array([71.09875189, 76.2080582 , 75.62945901, ..., 69.52470475,
73.78822268, 74.07809092])
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
# Calculate the Mean Absolute Error
mae_pred1 = mean_absolute_error(y_test1, y_pred1)
# Calculate the Mean Squared Error
mse1 = mean_squared_error(y_pred1, y_test1)
# Calculate the Root Mean Square Error
rmse1 = mean_squared_error(y_test1, y_pred1, squared=False)
print('MAE:',mae_pred1)
print('MSE:',mse1)
print('RMSE:',rmse1)
MAE: 8.158775554020059 MSE: 100.2624994530014 RMSE: 10.013116370691064
The MAE obtained the average of the errors of the predicted value is 8, which seems to be a small number in SCORE, where 100 is the highest. However, considering that the width of most score distributions is small, it is difficult to judge that it is accurate. Let's create more diverse models.
plt.figure()
plt.scatter(y_test1, y_pred1, alpha=0.4)
plt.xlabel('Actual Score')
plt.ylabel('Predicted Score')
plt.title('Multiple Linear Regression')
plt.show()
The above chart shows the prediction and actual data of linear regression analysis with numerical data. Unfortunately, the prediction is not successful because it does not have a linear shape.
import statsmodels.api as sm
In the df_clean, we have columns of identification, location information. However, they are not required for linear regression models, so we will only use what we need in the model.
In this model, the variable name is 2
# They are columns that have no redundancy and have unique values used for evaluation.
eva_columns2 = df_clean[['YEAR_REGISTERED',
'YEAR_BUILT',
'CONFIRMED_STOREYS',
'CONFIRMED_UNITS',
'NO_OF_AREAS_EVALUATED',
'ENTRANCE_LOBBY',
'ENTRANCE_DOORS_WINDOWS',
'SECURITY',
'STAIRWELLS',
'LAUNDRY_ROOMS',
'INTERNAL_GUARDS_HANDRAILS',
'GARBAGE_BIN_STORAGE_AREA',
'INTERIOR_WALL_CEILING_FLOOR',
'INTERIOR_LIGHTING_LEVELS',
'GRAFFITI',
'EXTERIOR_CLADDING',
'EXTERIOR_GROUNDS',
'EXTERIOR_WALKWAYS',
'BALCONY_GUARDS',
'WATER_PEN_EXT_BLDG_ELEMENTS',
'PARKING_AREA']]
X2 = eva_columns2
y2 = df_clean['SCORE']
X2_withconstant2 = sm.add_constant(X2)
# 1. Instantiate Model
myregression2 = sm.OLS(y2, X2_withconstant2)
# 2. Fit Model (this returns a seperate object with the parameters)
myregression_results2 = myregression2.fit()
# Looking at the summary
myregression_results2.summary()
| Dep. Variable: | SCORE | R-squared: | 0.989 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.989 |
| Method: | Least Squares | F-statistic: | 4.938e+04 |
| Date: | Wed, 06 Sep 2023 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:40 | Log-Likelihood: | -16586. |
| No. Observations: | 11152 | AIC: | 3.322e+04 |
| Df Residuals: | 11130 | BIC: | 3.338e+04 |
| Df Model: | 21 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -59.6513 | 35.145 | -1.697 | 0.090 | -128.542 | 9.240 |
| YEAR_REGISTERED | 0.0274 | 0.017 | 1.573 | 0.116 | -0.007 | 0.061 |
| YEAR_BUILT | 0.0028 | 0.001 | 4.198 | 0.000 | 0.001 | 0.004 |
| CONFIRMED_STOREYS | 0.0102 | 0.003 | 2.986 | 0.003 | 0.004 | 0.017 |
| CONFIRMED_UNITS | -1.821e-05 | 0.000 | -0.082 | 0.934 | -0.000 | 0.000 |
| NO_OF_AREAS_EVALUATED | -0.1220 | 0.009 | -13.913 | 0.000 | -0.139 | -0.105 |
| ENTRANCE_LOBBY | 1.4649 | 0.020 | 72.374 | 0.000 | 1.425 | 1.505 |
| ENTRANCE_DOORS_WINDOWS | 1.2556 | 0.019 | 67.197 | 0.000 | 1.219 | 1.292 |
| SECURITY | 1.2852 | 0.015 | 85.412 | 0.000 | 1.256 | 1.315 |
| STAIRWELLS | 1.3693 | 0.019 | 73.234 | 0.000 | 1.333 | 1.406 |
| LAUNDRY_ROOMS | 1.3357 | 0.017 | 76.329 | 0.000 | 1.301 | 1.370 |
| INTERNAL_GUARDS_HANDRAILS | 1.3049 | 0.015 | 86.773 | 0.000 | 1.275 | 1.334 |
| GARBAGE_BIN_STORAGE_AREA | 1.3553 | 0.016 | 83.072 | 0.000 | 1.323 | 1.387 |
| INTERIOR_WALL_CEILING_FLOOR | 1.3242 | 0.019 | 71.316 | 0.000 | 1.288 | 1.361 |
| INTERIOR_LIGHTING_LEVELS | 1.3151 | 0.016 | 84.056 | 0.000 | 1.284 | 1.346 |
| GRAFFITI | 1.1988 | 0.015 | 80.734 | 0.000 | 1.170 | 1.228 |
| EXTERIOR_CLADDING | 1.2485 | 0.020 | 63.164 | 0.000 | 1.210 | 1.287 |
| EXTERIOR_GROUNDS | 1.3542 | 0.019 | 70.484 | 0.000 | 1.317 | 1.392 |
| EXTERIOR_WALKWAYS | 1.2181 | 0.019 | 65.329 | 0.000 | 1.182 | 1.255 |
| BALCONY_GUARDS | 0.9270 | 0.017 | 54.618 | 0.000 | 0.894 | 0.960 |
| WATER_PEN_EXT_BLDG_ELEMENTS | 1.2169 | 0.018 | 66.434 | 0.000 | 1.181 | 1.253 |
| PARKING_AREA | 1.0911 | 0.016 | 66.976 | 0.000 | 1.059 | 1.123 |
| Omnibus: | 253.878 | Durbin-Watson: | 1.864 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 579.714 |
| Skew: | -0.032 | Prob(JB): | 1.31e-126 |
| Kurtosis: | 4.115 | Cond. No. | 9.75e+06 |
as you can see, the p-value of 'CONFIRMED_UNIT' and 'YEAR_REGISTERED' are over 0.05 therfore we will drop it.
eva_columns3 = df_clean[[
'YEAR_BUILT',
'CONFIRMED_STOREYS',
'NO_OF_AREAS_EVALUATED',
'ENTRANCE_LOBBY',
'ENTRANCE_DOORS_WINDOWS',
'SECURITY',
'STAIRWELLS',
'LAUNDRY_ROOMS',
'INTERNAL_GUARDS_HANDRAILS',
'GARBAGE_BIN_STORAGE_AREA',
'INTERIOR_WALL_CEILING_FLOOR',
'INTERIOR_LIGHTING_LEVELS',
'GRAFFITI',
'EXTERIOR_CLADDING',
'EXTERIOR_GROUNDS',
'EXTERIOR_WALKWAYS',
'BALCONY_GUARDS',
'WATER_PEN_EXT_BLDG_ELEMENTS',
'PARKING_AREA']]
In this model, the variable name is 3
X3 = eva_columns3
y3 = df_clean['SCORE']
X3_withconstant3 = sm.add_constant(X3)
# 1. Instantiate Model
myregression3 = sm.OLS(y3, X3_withconstant3)
# 2. Fit Model (this returns a seperate object with the parameters)
myregression_results3 = myregression3.fit()
# Looking at the summary
myregression_results3.summary()
| Dep. Variable: | SCORE | R-squared: | 0.989 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.989 |
| Method: | Least Squares | F-statistic: | 5.457e+04 |
| Date: | Wed, 06 Sep 2023 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:40 | Log-Likelihood: | -16588. |
| No. Observations: | 11152 | AIC: | 3.322e+04 |
| Df Residuals: | 11132 | BIC: | 3.336e+04 |
| Df Model: | 19 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -4.4001 | 1.236 | -3.560 | 0.000 | -6.823 | -1.977 |
| YEAR_BUILT | 0.0027 | 0.001 | 4.177 | 0.000 | 0.001 | 0.004 |
| CONFIRMED_STOREYS | 0.0100 | 0.002 | 4.691 | 0.000 | 0.006 | 0.014 |
| NO_OF_AREAS_EVALUATED | -0.1235 | 0.009 | -14.311 | 0.000 | -0.140 | -0.107 |
| ENTRANCE_LOBBY | 1.4657 | 0.020 | 72.435 | 0.000 | 1.426 | 1.505 |
| ENTRANCE_DOORS_WINDOWS | 1.2556 | 0.019 | 67.198 | 0.000 | 1.219 | 1.292 |
| SECURITY | 1.2847 | 0.015 | 85.402 | 0.000 | 1.255 | 1.314 |
| STAIRWELLS | 1.3694 | 0.019 | 73.241 | 0.000 | 1.333 | 1.406 |
| LAUNDRY_ROOMS | 1.3344 | 0.017 | 76.349 | 0.000 | 1.300 | 1.369 |
| INTERNAL_GUARDS_HANDRAILS | 1.3060 | 0.015 | 86.960 | 0.000 | 1.277 | 1.335 |
| GARBAGE_BIN_STORAGE_AREA | 1.3557 | 0.016 | 83.119 | 0.000 | 1.324 | 1.388 |
| INTERIOR_WALL_CEILING_FLOOR | 1.3236 | 0.019 | 71.359 | 0.000 | 1.287 | 1.360 |
| INTERIOR_LIGHTING_LEVELS | 1.3153 | 0.016 | 84.117 | 0.000 | 1.285 | 1.346 |
| GRAFFITI | 1.1984 | 0.015 | 80.981 | 0.000 | 1.169 | 1.227 |
| EXTERIOR_CLADDING | 1.2486 | 0.020 | 63.166 | 0.000 | 1.210 | 1.287 |
| EXTERIOR_GROUNDS | 1.3542 | 0.019 | 70.511 | 0.000 | 1.317 | 1.392 |
| EXTERIOR_WALKWAYS | 1.2177 | 0.019 | 65.312 | 0.000 | 1.181 | 1.254 |
| BALCONY_GUARDS | 0.9273 | 0.017 | 54.640 | 0.000 | 0.894 | 0.961 |
| WATER_PEN_EXT_BLDG_ELEMENTS | 1.2169 | 0.018 | 66.432 | 0.000 | 1.181 | 1.253 |
| PARKING_AREA | 1.0909 | 0.016 | 66.971 | 0.000 | 1.059 | 1.123 |
| Omnibus: | 254.512 | Durbin-Watson: | 1.863 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 582.448 |
| Skew: | -0.030 | Prob(JB): | 3.34e-127 |
| Kurtosis: | 4.118 | Cond. No. | 2.39e+05 |
We thought about the combination of various variables. This time, let's check the degree of correlation and continue.
X3.corr()
| YEAR_BUILT | CONFIRMED_STOREYS | NO_OF_AREAS_EVALUATED | ENTRANCE_LOBBY | ENTRANCE_DOORS_WINDOWS | SECURITY | STAIRWELLS | LAUNDRY_ROOMS | INTERNAL_GUARDS_HANDRAILS | GARBAGE_BIN_STORAGE_AREA | INTERIOR_WALL_CEILING_FLOOR | INTERIOR_LIGHTING_LEVELS | GRAFFITI | EXTERIOR_CLADDING | EXTERIOR_GROUNDS | EXTERIOR_WALKWAYS | BALCONY_GUARDS | WATER_PEN_EXT_BLDG_ELEMENTS | PARKING_AREA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YEAR_BUILT | 1.000000 | 0.366900 | 0.522412 | 0.171732 | 0.120779 | 0.090139 | 0.076234 | 0.150829 | 0.184785 | 0.063703 | 0.067745 | 0.135345 | -0.026563 | 0.211636 | 0.108594 | 0.140493 | 0.028208 | 0.155352 | 0.115760 |
| CONFIRMED_STOREYS | 0.366900 | 1.000000 | 0.593097 | 0.219177 | 0.131271 | 0.120292 | 0.006160 | 0.157234 | 0.153452 | 0.044288 | 0.030435 | 0.142126 | -0.119631 | 0.117091 | 0.088737 | 0.098031 | 0.027771 | 0.075040 | 0.007841 |
| NO_OF_AREAS_EVALUATED | 0.522412 | 0.593097 | 1.000000 | 0.304457 | 0.186586 | 0.179652 | 0.136259 | 0.275323 | 0.233998 | 0.116819 | 0.110421 | 0.215188 | -0.011284 | 0.173537 | 0.157517 | 0.161837 | 0.001725 | 0.130761 | 0.116301 |
| ENTRANCE_LOBBY | 0.171732 | 0.219177 | 0.304457 | 1.000000 | 0.586762 | 0.495384 | 0.562646 | 0.565121 | 0.429285 | 0.457104 | 0.545580 | 0.517893 | 0.281387 | 0.442738 | 0.527497 | 0.479569 | 0.343083 | 0.389781 | 0.323259 |
| ENTRANCE_DOORS_WINDOWS | 0.120779 | 0.131271 | 0.186586 | 0.586762 | 1.000000 | 0.510583 | 0.447198 | 0.458404 | 0.387223 | 0.427268 | 0.488562 | 0.497602 | 0.276959 | 0.459706 | 0.521023 | 0.492676 | 0.344897 | 0.423843 | 0.351325 |
| SECURITY | 0.090139 | 0.120292 | 0.179652 | 0.495384 | 0.510583 | 1.000000 | 0.396753 | 0.419171 | 0.370863 | 0.418373 | 0.408332 | 0.503602 | 0.248762 | 0.344039 | 0.435027 | 0.401815 | 0.289303 | 0.382347 | 0.292383 |
| STAIRWELLS | 0.076234 | 0.006160 | 0.136259 | 0.562646 | 0.447198 | 0.396753 | 1.000000 | 0.505245 | 0.469083 | 0.422580 | 0.599374 | 0.461718 | 0.323344 | 0.385585 | 0.475031 | 0.427633 | 0.299292 | 0.389476 | 0.357707 |
| LAUNDRY_ROOMS | 0.150829 | 0.157234 | 0.275323 | 0.565121 | 0.458404 | 0.419171 | 0.505245 | 1.000000 | 0.381326 | 0.417581 | 0.484054 | 0.490469 | 0.236123 | 0.389613 | 0.465503 | 0.425406 | 0.288619 | 0.351577 | 0.329924 |
| INTERNAL_GUARDS_HANDRAILS | 0.184785 | 0.153452 | 0.233998 | 0.429285 | 0.387223 | 0.370863 | 0.469083 | 0.381326 | 1.000000 | 0.340644 | 0.354027 | 0.411449 | 0.167570 | 0.337728 | 0.359546 | 0.369165 | 0.271873 | 0.362661 | 0.282280 |
| GARBAGE_BIN_STORAGE_AREA | 0.063703 | 0.044288 | 0.116819 | 0.457104 | 0.427268 | 0.418373 | 0.422580 | 0.417581 | 0.340644 | 1.000000 | 0.399484 | 0.412616 | 0.242217 | 0.374213 | 0.478326 | 0.423085 | 0.323684 | 0.353745 | 0.349497 |
| INTERIOR_WALL_CEILING_FLOOR | 0.067745 | 0.030435 | 0.110421 | 0.545580 | 0.488562 | 0.408332 | 0.599374 | 0.484054 | 0.354027 | 0.399484 | 1.000000 | 0.489153 | 0.319970 | 0.393544 | 0.458141 | 0.414459 | 0.303679 | 0.377608 | 0.330263 |
| INTERIOR_LIGHTING_LEVELS | 0.135345 | 0.142126 | 0.215188 | 0.517893 | 0.497602 | 0.503602 | 0.461718 | 0.490469 | 0.411449 | 0.412616 | 0.489153 | 1.000000 | 0.227596 | 0.388030 | 0.461578 | 0.432152 | 0.312306 | 0.394972 | 0.341143 |
| GRAFFITI | -0.026563 | -0.119631 | -0.011284 | 0.281387 | 0.276959 | 0.248762 | 0.323344 | 0.236123 | 0.167570 | 0.242217 | 0.319970 | 0.227596 | 1.000000 | 0.225538 | 0.291444 | 0.244344 | 0.179703 | 0.234188 | 0.191078 |
| EXTERIOR_CLADDING | 0.211636 | 0.117091 | 0.173537 | 0.442738 | 0.459706 | 0.344039 | 0.385585 | 0.389613 | 0.337728 | 0.374213 | 0.393544 | 0.388030 | 0.225538 | 1.000000 | 0.458361 | 0.459323 | 0.413140 | 0.591237 | 0.340671 |
| EXTERIOR_GROUNDS | 0.108594 | 0.088737 | 0.157517 | 0.527497 | 0.521023 | 0.435027 | 0.475031 | 0.465503 | 0.359546 | 0.478326 | 0.458141 | 0.461578 | 0.291444 | 0.458361 | 1.000000 | 0.586838 | 0.352748 | 0.426948 | 0.400288 |
| EXTERIOR_WALKWAYS | 0.140493 | 0.098031 | 0.161837 | 0.479569 | 0.492676 | 0.401815 | 0.427633 | 0.425406 | 0.369165 | 0.423085 | 0.414459 | 0.432152 | 0.244344 | 0.459323 | 0.586838 | 1.000000 | 0.325823 | 0.423368 | 0.399761 |
| BALCONY_GUARDS | 0.028208 | 0.027771 | 0.001725 | 0.343083 | 0.344897 | 0.289303 | 0.299292 | 0.288619 | 0.271873 | 0.323684 | 0.303679 | 0.312306 | 0.179703 | 0.413140 | 0.352748 | 0.325823 | 1.000000 | 0.348556 | 0.251993 |
| WATER_PEN_EXT_BLDG_ELEMENTS | 0.155352 | 0.075040 | 0.130761 | 0.389781 | 0.423843 | 0.382347 | 0.389476 | 0.351577 | 0.362661 | 0.353745 | 0.377608 | 0.394972 | 0.234188 | 0.591237 | 0.426948 | 0.423368 | 0.348556 | 1.000000 | 0.340585 |
| PARKING_AREA | 0.115760 | 0.007841 | 0.116301 | 0.323259 | 0.351325 | 0.292383 | 0.357707 | 0.329924 | 0.282280 | 0.349497 | 0.330263 | 0.341143 | 0.191078 | 0.340671 | 0.400288 | 0.399761 | 0.251993 | 0.340585 | 1.000000 |
# Calculate the correlation matrix
correlation_matrix3 = X3.corr()
# Set up the matplotlib figure
plt.figure(figsize=(10, 8))
# Create a heatmap using seaborn
sns.heatmap(correlation_matrix3, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of X4 Variables')
plt.show()
In the heatmap we can observer that there are some correlations that over 0.5. However we can't drop everything so we're going to delete only one 'ENTRANCE_LOBBY' that's highly relevant to several columns and try to do it again.
In this model, the variable name is 4
eva_columns4 = df_clean[[
'YEAR_BUILT',
'CONFIRMED_STOREYS',
'NO_OF_AREAS_EVALUATED',
'ENTRANCE_DOORS_WINDOWS',
'SECURITY',
'STAIRWELLS',
'LAUNDRY_ROOMS',
'INTERNAL_GUARDS_HANDRAILS',
'GARBAGE_BIN_STORAGE_AREA',
'INTERIOR_WALL_CEILING_FLOOR',
'INTERIOR_LIGHTING_LEVELS',
'GRAFFITI',
'EXTERIOR_CLADDING',
'EXTERIOR_GROUNDS',
'EXTERIOR_WALKWAYS',
'BALCONY_GUARDS',
'WATER_PEN_EXT_BLDG_ELEMENTS',
'PARKING_AREA']]
X4 = eva_columns4
y4 = df_clean['SCORE']
X4_withconstant4 = sm.add_constant(X4)
# 1. Instantiate Model
myregression4 = sm.OLS(y4, X4_withconstant4)
# 2. Fit Model (this returns a seperate object with the parameters)
myregression_results4 = myregression4.fit()
# Looking at the summary
myregression_results4.summary()
| Dep. Variable: | SCORE | R-squared: | 0.984 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.984 |
| Method: | Least Squares | F-statistic: | 3.896e+04 |
| Date: | Wed, 06 Sep 2023 | Prob (F-statistic): | 0.00 |
| Time: | 21:24:41 | Log-Likelihood: | -18741. |
| No. Observations: | 11152 | AIC: | 3.752e+04 |
| Df Residuals: | 11133 | BIC: | 3.766e+04 |
| Df Model: | 18 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -4.2248 | 1.499 | -2.818 | 0.005 | -7.164 | -1.286 |
| YEAR_BUILT | 0.0021 | 0.001 | 2.659 | 0.008 | 0.001 | 0.004 |
| CONFIRMED_STOREYS | 0.0247 | 0.003 | 9.597 | 0.000 | 0.020 | 0.030 |
| NO_OF_AREAS_EVALUATED | -0.0579 | 0.010 | -5.567 | 0.000 | -0.078 | -0.038 |
| ENTRANCE_DOORS_WINDOWS | 1.5276 | 0.022 | 68.804 | 0.000 | 1.484 | 1.571 |
| SECURITY | 1.3880 | 0.018 | 76.414 | 0.000 | 1.352 | 1.424 |
| STAIRWELLS | 1.6038 | 0.022 | 71.808 | 0.000 | 1.560 | 1.648 |
| LAUNDRY_ROOMS | 1.5422 | 0.021 | 73.750 | 0.000 | 1.501 | 1.583 |
| INTERNAL_GUARDS_HANDRAILS | 1.3464 | 0.018 | 73.963 | 0.000 | 1.311 | 1.382 |
| GARBAGE_BIN_STORAGE_AREA | 1.4362 | 0.020 | 72.767 | 0.000 | 1.397 | 1.475 |
| INTERIOR_WALL_CEILING_FLOOR | 1.4963 | 0.022 | 67.063 | 0.000 | 1.453 | 1.540 |
| INTERIOR_LIGHTING_LEVELS | 1.3809 | 0.019 | 72.935 | 0.000 | 1.344 | 1.418 |
| GRAFFITI | 1.2443 | 0.018 | 69.386 | 0.000 | 1.209 | 1.279 |
| EXTERIOR_CLADDING | 1.3158 | 0.024 | 54.941 | 0.000 | 1.269 | 1.363 |
| EXTERIOR_GROUNDS | 1.4684 | 0.023 | 63.247 | 0.000 | 1.423 | 1.514 |
| EXTERIOR_WALKWAYS | 1.2764 | 0.023 | 56.497 | 0.000 | 1.232 | 1.321 |
| BALCONY_GUARDS | 0.9960 | 0.021 | 48.461 | 0.000 | 0.956 | 1.036 |
| WATER_PEN_EXT_BLDG_ELEMENTS | 1.1711 | 0.022 | 52.740 | 0.000 | 1.128 | 1.215 |
| PARKING_AREA | 1.0413 | 0.020 | 52.750 | 0.000 | 1.003 | 1.080 |
| Omnibus: | 238.926 | Durbin-Watson: | 1.828 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 469.268 |
| Skew: | 0.129 | Prob(JB): | 1.26e-102 |
| Kurtosis: | 3.972 | Cond. No. | 2.39e+05 |
The linear regression model has a very accurate value with an r-squared of 0.984. However, the condition number is large, 2.39e+05. This might indicate that there are strong multicollinearity.
By inferring from the results, it is estimated that the model made a formula from the evaluation indicators and calculated. That's why you have to try another model analysis.
In this model, the variable name is 5
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Assign x and y data
X5 = df_clean.drop(['SCORE', 'RESULTS_OF_SCORE'], axis=1)
y5 = df_clean['SCORE']
# We have object columns like 'PROPERTY_TYPE', 'WARDNAME' , etc.
X5_encoded = pd.get_dummies(X5, columns=['PROPERTY_TYPE', 'WARDNAME', 'GRID'])
# Split test and train data.
X5_train, X5_test, y5_train, y5_test = train_test_split(X5_encoded, y5, test_size=0.2, random_state=42)
rf_model5 = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model5.fit(X5_train, y5_train)
y5_pred = rf_model5.predict(X5_test)
mae5 = mean_absolute_error(y5_test, y5_pred)
mse5 = mean_squared_error(y5_test, y5_pred)
rmse5 = np.sqrt(mse5)
r2_5 = r2_score(y5_test, y5_pred)
print("Mean Abolute Error", mae5)
print("Mean Squared Error:", mse5)
print("Root Mean Squared Error:", rmse5)
print("R square:", r2_5)
Mean Abolute Error 1.5330793366203497 Mean Squared Error: 4.212218735992828 Root Mean Squared Error: 2.0523690545301125 R square: 0.9605444873189599
The Mean Squared Error (MSE) value obtained from the model evaluation is 4.21. In the context of this problem where SCORE values range from 0 to 100, an MSE of 4.21 can be considered relatively low, indicating that our model's predictions are reasonably close to the actual values.
Additonally, the R square is 0.96 and it is reasonable number in machine learning. But there may be a more optimized model, so let's look for it.
n_estimators_range = range(10, 211, 25)
mse_scores = []
r2_scores =[]
for n_estimators in n_estimators_range:
# make model
rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
# model fit
rf_model.fit(X5_train, y5_train)
# evaluation
y_pred = rf_model.predict(X5_test)
mse = mean_squared_error(y5_test, y_pred)
r2 = r2_score(y5_test, y_pred)
mse_scores.append(mse)
r2_scores.append(r2)
print('*The number of e_estimator: ', n_estimators)
print('Mean Squared Error: ', mse)
print('R square:', r2)
print('')
*The number of e_estimator: 10 Mean Squared Error: 5.425235320484088 R square: 0.949182258946832 *The number of e_estimator: 35 Mean Squared Error: 4.384696530337818 R square: 0.958928901750284 *The number of e_estimator: 60 Mean Squared Error: 4.2562310125006215 R square: 0.9601322278797632 *The number of e_estimator: 85 Mean Squared Error: 4.228498027945325 R square: 0.960392000506112 *The number of e_estimator: 110 Mean Squared Error: 4.237120292201176 R square: 0.9603112364532438 *The number of e_estimator: 135 Mean Squared Error: 4.193067359239644 R square: 0.960723876718159 *The number of e_estimator: 160 Mean Squared Error: 4.197450239522635 R square: 0.9606828226325514 *The number of e_estimator: 185 Mean Squared Error: 4.2190710812087735 R square: 0.9604803019547831 *The number of e_estimator: 210 Mean Squared Error: 4.196148905700035 R square: 0.9606950121213597
plt.figure(figsize=(12, 6))
# MSE
plt.subplot(1, 2, 1)
plt.plot(n_estimators_range, mse_scores, marker='o', linestyle='-', color='blue')
plt.title('Mean Squared Error (MSE) vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('Mean Squared Error (MSE)')
plt.grid(True)
# R-squared
plt.subplot(1, 2, 2)
plt.plot(n_estimators_range, r2_scores, marker='o', linestyle='-', color='green')
plt.title('R-squared vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('R-squared')
plt.grid(True)
plt.tight_layout()
plt.show()
Let's take a closer look at that between 125 and 175.
n_estimators_range = range(125, 176, 10)
mse_scores = []
r2_scores =[]
for n_estimators in n_estimators_range:
# make model
rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
# model fit
rf_model.fit(X5_train, y5_train)
# evaluation
y_pred = rf_model.predict(X5_test)
mse = mean_squared_error(y5_test, y_pred)
r2 = r2_score(y5_test, y_pred)
mse_scores.append(mse)
r2_scores.append(r2)
print('*The number of e_estimator: ', n_estimators)
print('Mean Squared Error: ', mse)
print('R square:', r2)
print('')
*The number of e_estimator: 125 Mean Squared Error: 4.20691022142537 R square: 0.960594211746147 *The number of e_estimator: 135 Mean Squared Error: 4.193067359239644 R square: 0.960723876718159 *The number of e_estimator: 145 Mean Squared Error: 4.188086305229894 R square: 0.9607705338487509 *The number of e_estimator: 155 Mean Squared Error: 4.194283651377267 R square: 0.9607124838079453 *The number of e_estimator: 165 Mean Squared Error: 4.199505869172142 R square: 0.9606635677156569 *The number of e_estimator: 175 Mean Squared Error: 4.203051034129474 R square: 0.9606303604418429
In this model, the model name is 42 and we will use the variable named 41
145 seems most appropriate as n_estimators. Let's make a model and evaluate it by applying this parameter.
plt.figure(figsize=(12, 6))
# MSE
plt.subplot(1, 2, 1)
plt.plot(n_estimators_range, mse_scores, marker='o', linestyle='-', color='blue')
plt.title('Mean Squared Error (MSE) vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('Mean Squared Error (MSE)')
plt.grid(True)
# R-squared
plt.subplot(1, 2, 2)
plt.plot(n_estimators_range, r2_scores, marker='o', linestyle='-', color='green')
plt.title('R-squared vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('R-squared')
plt.grid(True)
plt.tight_layout()
plt.show()
rf_model6 = RandomForestRegressor(n_estimators=145, random_state=42)
rf_model6.fit(X5_train, y5_train)
y6_pred = rf_model6.predict(X5_test)
mae6 = mean_absolute_error(y5_test, y6_pred)
mse6 = mean_squared_error(y5_test, y6_pred)
rmse6 = np.sqrt(mse6)
r2_6 = r2_score(y5_test, y6_pred)
print("Mean Abolute Error", mae6)
print("Mean Squared Error:", mse6)
print("Root Mean Squared Error:", rmse6)
print("R square:", r2_6)
Mean Abolute Error 1.5289262585202243 Mean Squared Error: 4.188086305229894 Root Mean Squared Error: 2.046481445122309 R square: 0.9607705338487509
In regression modeling, a random forest regression is more suitable than linear regression. With an accuracy of 96.07% and a very low Mean Squared Error of 4.18, it performs exceptionally well.
From here, this model is to make a simple demo model. We will make a simple model with only six important columns. It is used for demonstration.
Let's find the five important columns.
feature_importances = rf_model6.feature_importances_
feature_names = X5_train.columns
# Gtop 20 features
top_20_indices = feature_importances.argsort()[-20:][::-1]
# feature names and importances
top_20_features = [feature_names[i] for i in top_20_indices]
top_20_importances = [feature_importances[i] for i in top_20_indices]
top_20_features.reverse()
top_20_importances.reverse()
plt.figure(figsize=(10, 8))
plt.barh(top_20_features, top_20_importances)
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Top 20 Random Forest Model Feature Importance")
plt.show()
The 6 above important columns are :
'ENTRANCE_LOBBY', 'EXTERIOR_GROUNDS', 'STAIRWELLS', 'INTERIOR_WALL_CEILING_FLOOR', 'INTERIOR_LIGHTING_LEVELS', 'WATER_PEN_EXT_BLDG_ELEMENTS'
# Assign x and y data
X7 = df_clean[['ENTRANCE_LOBBY', 'EXTERIOR_GROUNDS', 'STAIRWELLS', 'INTERIOR_WALL_CEILING_FLOOR', 'INTERIOR_LIGHTING_LEVELS', 'WATER_PEN_EXT_BLDG_ELEMENTS']]
y7 = df_clean['SCORE']
# Split test and train data.
X7_train, X7_test, y7_train, y7_test = train_test_split(X7, y7, test_size=0.3, random_state=42)
rf_model7 = RandomForestRegressor(n_estimators=145, random_state=42)
rf_model7.fit(X7_train, y7_train)
y7_pred = rf_model7.predict(X7_test)
mae7 = mean_absolute_error(y7_test, y7_pred)
mse7 = mean_squared_error(y7_test, y7_pred)
rmse7 = np.sqrt(mse7)
r2_7 = r2_score(y7_test, y7_pred)
print("Mean Abolute Error", mae7)
print("Mean Squared Error:", mse7)
print("Root Mean Squared Error:", rmse7)
print("R square:", r2_7)
Mean Abolute Error 2.6637812073054423 Mean Squared Error: 11.581661512335685 Root Mean Squared Error: 3.403184025634771 R square: 0.8912770053909287
n_estimators_range = range(10, 211, 25)
mse_scores = []
r2_scores =[]
for n_estimators in n_estimators_range:
# make model
rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
# model fit
rf_model.fit(X7_train, y7_train)
# evaluation
y_pred = rf_model.predict(X7_test)
mse = mean_squared_error(y7_test, y_pred)
r2 = r2_score(y7_test, y_pred)
mse_scores.append(mse)
r2_scores.append(r2)
print('*The number of e_estimator: ', n_estimators)
print('Mean Squared Error: ', mse)
print('R square:', r2)
print('')
*The number of e_estimator: 10 Mean Squared Error: 11.919804656364244 R square: 0.8881026823297553 *The number of e_estimator: 35 Mean Squared Error: 11.697772502541822 R square: 0.8901870119950078 *The number of e_estimator: 60 Mean Squared Error: 11.635432944103501 R square: 0.8907722253919632 *The number of e_estimator: 85 Mean Squared Error: 11.602684845435464 R square: 0.8910796485843222 *The number of e_estimator: 110 Mean Squared Error: 11.603302711354798 R square: 0.8910738483601535 *The number of e_estimator: 135 Mean Squared Error: 11.573660275985372 R square: 0.8913521170988347 *The number of e_estimator: 160 Mean Squared Error: 11.57595419446467 R square: 0.8913305829099627 *The number of e_estimator: 185 Mean Squared Error: 11.573992568441765 R square: 0.891348997699178 *The number of e_estimator: 210 Mean Squared Error: 11.577904987297 R square: 0.8913122698174615
plt.figure(figsize=(12, 6))
# MSE
plt.subplot(1, 2, 1)
plt.plot(n_estimators_range, mse_scores, marker='o', linestyle='-', color='blue')
plt.title('Mean Squared Error (MSE) vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('Mean Squared Error (MSE)')
plt.grid(True)
# R-squared
plt.subplot(1, 2, 2)
plt.plot(n_estimators_range, r2_scores, marker='o', linestyle='-', color='green')
plt.title('R-squared vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('R-squared')
plt.grid(True)
plt.tight_layout()
plt.show()
If we use only six(6) columns to predict, 127 is the best n_estimators. Let's make our model for demo.
rf_model8 = RandomForestRegressor(n_estimators=128, random_state=42)
rf_model8.fit(X7_train, y7_train)
y8_pred = rf_model8.predict(X7_test)
mae8 = mean_absolute_error(y7_test, y8_pred)
mse8 = mean_squared_error(y7_test, y8_pred)
rmse8 = np.sqrt(mse8)
r2_8 = r2_score(y7_test, y8_pred)
print("Mean Abolute Error", mae8)
print("Mean Squared Error:", mse8)
print("Root Mean Squared Error:", rmse8)
print("R square:", r2_8)
Mean Abolute Error 2.663881573123019 Mean Squared Error: 11.587857661382285 Root Mean Squared Error: 3.404094249779563 R square: 0.8912188389630225
Save my model with Pickle.
import pickle
# Load the model from the file
with open('../models/rf_model8.pkl', 'wb') as file:
pickle.dump(rf_model8, file)
To use the classification models, we should check the target feature. The target variable must be categorical data.
'RESULTS_OF_SCORE' column is our target column. Let's explore it first.
agg_func_math = {
'SCORE':
['count', 'mean', 'median', 'min', 'max', 'std', 'var']
}
df_clean.groupby(['RESULTS_OF_SCORE']).agg(agg_func_math).round(2)
| SCORE | |||||||
|---|---|---|---|---|---|---|---|
| count | mean | median | min | max | std | var | |
| RESULTS_OF_SCORE | |||||||
| Building Audit | 98 | 45.79 | 47.0 | 20 | 50 | 4.36 | 19.02 |
| Evaluation needs to be conducted in 1 year | 2476 | 60.43 | 61.0 | 51 | 65 | 3.67 | 13.50 |
| Evaluation needs to be conducted in 2 years | 7077 | 75.43 | 76.0 | 66 | 85 | 5.42 | 29.37 |
| Evaluation needs to be conducted in 3 years | 1501 | 90.21 | 89.0 | 86 | 100 | 3.51 | 12.30 |
we have 4 levels in this column.
The greatest level is 'Evaluation needs to be conducted in 3 years' and the mean of it is 90.21. The 86 to 100 scores belong here. 89 points is the median.
'Evaluation needs to be conducted in 2 years' accounnt for the largest number as 7077. The mean is 75.43 and median is 76. The variance of it is the largest at 29.37.
'Building Audit' is the price we should pay the most attention to. Because our goal is to make predictions so that we can avoid this score. Points 20 to 50 are included here. The mean is 45.79 and the median is 47.0. </span>
In the machine learning every variable should be presented as numbers. So we will make a new column.
# create new column for categorical resuslt to change the value to numbers.
df_clean['RESULTS_CODE'] = df_clean['RESULTS_OF_SCORE']
# input the values depends on the 'RESULTS_OF_SCORE'
df_clean.loc[df_clean['RESULTS_OF_SCORE'] == 'Building Audit', 'RESULTS_CODE'] = 0
df_clean.loc[df_clean['RESULTS_OF_SCORE'] == 'Evaluation needs to be conducted in 1 year', 'RESULTS_CODE'] = 1
df_clean.loc[df_clean['RESULTS_OF_SCORE'] == 'Evaluation needs to be conducted in 2 years', 'RESULTS_CODE'] = 2
df_clean.loc[df_clean['RESULTS_OF_SCORE'] == 'Evaluation needs to be conducted in 3 years', 'RESULTS_CODE'] = 3
df_clean['RESULTS_CODE'] = df_clean['RESULTS_CODE'].astype(int)
df_clean['RESULTS_CODE'].value_counts()
RESULTS_CODE 2 7077 1 2476 3 1501 0 98 Name: count, dtype: int64
The name will be made starting with a unit of 20.
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import plot_tree
In this model, the variable name is 21
X21 = df_clean.drop(['SCORE', 'RESULTS_OF_SCORE', 'RESULTS_CODE'], axis=1) # Features
y21 = df_clean['RESULTS_CODE'] # Target variable
X21_encoded = pd.get_dummies(X21, columns=['PROPERTY_TYPE', 'WARDNAME', 'GRID'])
X_train21, X_test21, y_train21, y_test21 = train_test_split(X21_encoded, y21, test_size=0.2, random_state=42)
dt_model21 = DecisionTreeClassifier(random_state=42, max_depth=3, min_samples_leaf=5)
dt_model21.fit(X_train21, y_train21)
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
< RESULTS_OF_SCORE >
0 : Building Audit
1 : Evaluation needs to be conducted in 1 year
2 : Evaluation needs to be conducted in 2 years
3 : Evaluation needs to be conducted in 3 years
y_pred21 = dt_model21.predict(X_test21)
accuracy = accuracy_score(y_test21, y_pred21)
report = classification_report(y_test21, y_pred21)
print(f"Accuracy: {accuracy}")
print("Report:\n", report)
Accuracy: 0.8014343343792022
Report:
precision recall f1-score support
0 0.00 0.00 0.00 17
1 0.77 0.73 0.75 489
2 0.81 0.92 0.86 1435
3 0.80 0.39 0.53 290
accuracy 0.80 2231
macro avg 0.59 0.51 0.53 2231
weighted avg 0.79 0.80 0.79 2231
percentage_0 = len(df_clean[df_clean['RESULTS_OF_SCORE'] == 'Building Audit'])/df_clean['RESULTS_OF_SCORE'].count()
print('percentage of Building Audit(0):', percentage_0)
percentage of Building Audit(0): 0.008787661406025825
The values of the most important 'Building Audit' for us all came out too small at 0.00. The reason is that the ratio of values is too small at 0.88%. That is why we have to upsampling.
from sklearn.utils import resample
In this model, the variable name is 22
X22 = df_clean.drop(['SCORE', 'RESULTS_OF_SCORE', 'RESULTS_CODE'], axis=1) # Features
y22 = df_clean['RESULTS_CODE'] # Target variable
X22_encoded = pd.get_dummies(X22, columns=['PROPERTY_TYPE', 'WARDNAME', 'GRID'])
X_train22, X_test22, y_train22, y_test22 = train_test_split(X22_encoded, y22, test_size=0.2, random_state=42)
In addition, existing targets have a severe data imbalance. Let's solve this together and try it.
from imblearn.over_sampling import SMOTE
# SMOTE
smote = SMOTE(random_state=42)
print('The numbers of Building Audit(0) before:', X_train22[y_train22 == 0].shape[0])
print('The numbers of Building Audit(1) before:', X_train22[y_train22 == 1].shape[0])
print('The numbers of Building Audit(2) before:', X_train22[y_train22 == 2].shape[0])
print('The numbers of Building Audit(3) before:', X_train22[y_train22 == 3].shape[0])
# resampling
X_train_resampled22, y_train_resampled22 = smote.fit_resample(X_train22, y_train22)
print()
print('The numbers of Building Audit(0) after:', X_train_resampled22[y_train_resampled22 == 0].shape[0])
print('The numbers of Building Audit(1) after:', X_train_resampled22[y_train_resampled22 == 1].shape[0])
print('The numbers of Building Audit(2) after:', X_train_resampled22[y_train_resampled22 == 2].shape[0])
print('The numbers of Building Audit(3) after:', X_train_resampled22[y_train_resampled22 == 3].shape[0])
The numbers of Building Audit(0) before: 81 The numbers of Building Audit(1) before: 1987 The numbers of Building Audit(2) before: 5642 The numbers of Building Audit(3) before: 1211 The numbers of Building Audit(0) after: 5642 The numbers of Building Audit(1) after: 5642 The numbers of Building Audit(2) after: 5642 The numbers of Building Audit(3) after: 5642
dt_model22 = DecisionTreeClassifier(random_state=42, max_depth=3, min_samples_leaf=5)
dt_model22.fit(X_train_resampled22, y_train_resampled22)
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
y_pred22 = dt_model22.predict(X_test22)
accuracy22 = accuracy_score(y_test22, y_pred22)
report22 = classification_report(y_test22, y_pred22)
print(f"Accuracy: {accuracy22}")
print("Report:\n", report22)
Accuracy: 0.7709547288211565
Report:
precision recall f1-score support
0 0.34 0.71 0.46 17
1 0.68 0.76 0.72 489
2 0.86 0.79 0.82 1435
3 0.60 0.71 0.65 290
accuracy 0.77 2231
macro avg 0.62 0.74 0.66 2231
weighted avg 0.79 0.77 0.78 2231
plt.figure(figsize=(10, 6))
plot_tree(dt_model22,
feature_names=dt_model22.feature_names_in_,
rounded=True,
impurity=False,
filled=True,
fontsize=11);
In the above cell, we set max_depth as 3. However the accuracy will be changed when we set it differently so we should find what is best number to make 'Decision Tree'.
train_acc = []
test_acc = []
# loop for finding best max_depth
for max_depth in range(1,12):
# initialize model
dt_model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
# fit model
dt_model.fit(X_train_resampled22, y_train_resampled22)
# score model
print('* The number of max_depth :', max_depth)
print('* The number of actual depth:', dt_model.get_depth())
print('Test accuracy:', dt_model.score(X_test22, y_test22))
print('Train accuracy', dt_model.score(X_train_resampled22, y_train_resampled22))
print('')
test_acc.append(dt_model.score(X_test22, y_test22))
train_acc.append(dt_model.score(X_train_resampled22, y_train_resampled22))
* The number of max_depth : 1 * The number of actual depth: 1 Test accuracy: 0.13536530703720304 Train accuracy 0.4869727047146402 * The number of max_depth : 2 * The number of actual depth: 2 Test accuracy: 0.3290004482294935 Train accuracy 0.6985554767812833 * The number of max_depth : 3 * The number of actual depth: 3 Test accuracy: 0.7709547288211565 Train accuracy 0.7706043956043956 * The number of max_depth : 4 * The number of actual depth: 4 Test accuracy: 0.7543702375616316 Train accuracy 0.8222704714640199 * The number of max_depth : 5 * The number of actual depth: 5 Test accuracy: 0.770058269834155 Train accuracy 0.8528890464374336 * The number of max_depth : 6 * The number of actual depth: 6 Test accuracy: 0.8095024652622143 Train accuracy 0.8663151364764268 * The number of max_depth : 7 * The number of actual depth: 7 Test accuracy: 0.8242940385477364 Train accuracy 0.8794753633463311 * The number of max_depth : 8 * The number of actual depth: 8 Test accuracy: 0.8072613177947109 Train accuracy 0.8974211272598369 * The number of max_depth : 9 * The number of actual depth: 9 Test accuracy: 0.8103989242492156 Train accuracy 0.912974122651542 * The number of max_depth : 10 * The number of actual depth: 10 Test accuracy: 0.8144329896907216 Train accuracy 0.9269319390287132 * The number of max_depth : 11 * The number of actual depth: 11 Test accuracy: 0.822052891080233 Train accuracy 0.9382754342431762
We will draw a plot to make it easy to recognize at a glance.
plt.figure(figsize=(10, 5))
plt.plot(range(1, 12), train_acc, marker="o", label="train accuracy")
plt.plot(range(1, 12), test_acc, marker="o", label="test accuracy")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
let's try max_depth as seven(6).
dt_model23 = DecisionTreeClassifier(random_state=42, max_depth=6, min_samples_leaf=5)
dt_model23.fit(X_train_resampled22, y_train_resampled22)
y_pred23 = dt_model23.predict(X_test22)
accuracy23 = accuracy_score(y_test22, y_pred23)
report23 = classification_report(y_test22, y_pred23)
print(f"Accuracy: {accuracy23}")
print("Report:\n", report23)
Accuracy: 0.8095024652622143
Report:
precision recall f1-score support
0 0.86 0.71 0.77 17
1 0.76 0.84 0.80 489
2 0.92 0.78 0.84 1435
3 0.58 0.92 0.71 290
accuracy 0.81 2231
macro avg 0.78 0.81 0.78 2231
weighted avg 0.84 0.81 0.81 2231
let's try max_depth as seven(7).
dt_model24 = DecisionTreeClassifier(random_state=42, max_depth=7, min_samples_leaf=5)
dt_model24.fit(X_train_resampled22, y_train_resampled22)
y_pred24 = dt_model24.predict(X_test22)
accuracy24 = accuracy_score(y_test22, y_pred24)
report24 = classification_report(y_test22, y_pred24)
print(f"Accuracy: {accuracy24}")
print("Report:\n", report24)
Accuracy: 0.8256387270282385
Report:
precision recall f1-score support
0 0.92 0.71 0.80 17
1 0.84 0.82 0.83 489
2 0.91 0.81 0.86 1435
3 0.57 0.91 0.70 290
accuracy 0.83 2231
macro avg 0.81 0.81 0.80 2231
weighted avg 0.85 0.83 0.83 2231
plt.figure(figsize=(10, 6))
plot_tree(dt_model23,
feature_names=dt_model23.feature_names_in_,
rounded=True,
impurity=False,
filled=True);
Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. It measures the accuracy of positive predictions.
Recall: Recall is the ratio of correctly predicted positive observations to all actual positive observations. It measures the model's ability to identify all relevant instances.
F1-Score: The F1-Score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is useful when there is an uneven class distribution.
When Max-depth was specified as 7, it was well predicted that the precision of Building Audit(0) was 0.92, recall was 0.71 and f1-score was 0.80.
From here, this model is to make a simple demo model. We will make a simple model with only eight important columns. It is used for demonstration.
importance = dt_model24.feature_importances_
feature_names = X_train_resampled22.columns
# importnace
feature_importance = sorted(zip(importance, feature_names), reverse=True)
# top 20
top_20_features = feature_importance[:20]
sorted_features = [feature for _, feature in top_20_features]
sorted_importance = [importance for importance, _ in top_20_features]
# plot
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_importance)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Top 20 Feature Importance in Decision Tree Model')
plt.gca().invert_yaxis() # 변수 이름을 내림차순으로 표시
plt.show()
X23 = df_clean[['ENTRANCE_DOORS_WINDOWS',
'ENTRANCE_LOBBY',
'INTERIOR_WALL_CEILING_FLOOR',
'WATER_PEN_EXT_BLDG_ELEMENTS',
'EXTERIOR_GROUNDS',
'INTERIOR_LIGHTING_LEVELS',
'STAIRWELLS',
'SECURITY']] # Features
y23 = df_clean['RESULTS_CODE'] # Target variable
X_train23, X_test23, y_train23, y_test23 = train_test_split(X23, y23, test_size=0.3, random_state=42)
# SMOTE
smote = SMOTE(random_state=42)
print('The numbers of Building Audit(0) before:', X_train23[y_train23 == 0].shape[0])
print('The numbers of Building Audit(1) before:', X_train23[y_train23 == 1].shape[0])
print('The numbers of Building Audit(2) before:', X_train23[y_train23 == 2].shape[0])
print('The numbers of Building Audit(3) before:', X_train23[y_train23 == 3].shape[0])
# resampling
X_train_resampled23, y_train_resampled23 = smote.fit_resample(X_train23, y_train23)
print()
print('The numbers of Building Audit(0) after:', X_train_resampled23[y_train_resampled23 == 0].shape[0])
print('The numbers of Building Audit(1) after:', X_train_resampled23[y_train_resampled23 == 1].shape[0])
print('The numbers of Building Audit(2) after:', X_train_resampled23[y_train_resampled23 == 2].shape[0])
print('The numbers of Building Audit(3) after:', X_train_resampled23[y_train_resampled23 == 3].shape[0])
The numbers of Building Audit(0) before: 72 The numbers of Building Audit(1) before: 1745 The numbers of Building Audit(2) before: 4924 The numbers of Building Audit(3) before: 1065 The numbers of Building Audit(0) after: 4924 The numbers of Building Audit(1) after: 4924 The numbers of Building Audit(2) after: 4924 The numbers of Building Audit(3) after: 4924
dt_model23 = DecisionTreeClassifier(random_state=42, max_depth=3, min_samples_leaf=5)
dt_model23.fit(X_train_resampled23, y_train_resampled23)
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
y_pred23 = dt_model23.predict(X_test23)
accuracy23 = accuracy_score(y_test23, y_pred23)
report23 = classification_report(y_test23, y_pred23)
print(f"Accuracy: {accuracy23}")
print("Report:\n", report23)
Accuracy: 0.7429766885833832
Report:
precision recall f1-score support
0 0.35 0.88 0.51 26
1 0.60 0.83 0.70 731
2 0.88 0.72 0.79 2153
3 0.60 0.70 0.65 436
accuracy 0.74 3346
macro avg 0.61 0.78 0.66 3346
weighted avg 0.78 0.74 0.75 3346
train_acc = []
test_acc = []
# loop for finding best max_depth
for max_depth in range(1,12):
# initialize model
dt_model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
# fit model
dt_model.fit(X_train_resampled23, y_train_resampled23)
# score model
print('* The number of max_depth :', max_depth)
print('* The number of actual depth:', dt_model.get_depth())
print('Test accuracy:', dt_model.score(X_test23, y_test23))
print('Train accuracy', dt_model.score(X_train_resampled23, y_train_resampled23))
print('')
test_acc.append(dt_model.score(X_test23, y_test23))
train_acc.append(dt_model.score(X_train_resampled23, y_train_resampled23))
* The number of max_depth : 1 * The number of actual depth: 1 Test accuracy: 0.13777644949193066 Train accuracy 0.497410641754671 * The number of max_depth : 2 * The number of actual depth: 2 Test accuracy: 0.3302450687387926 Train accuracy 0.6746547522339561 * The number of max_depth : 3 * The number of actual depth: 3 Test accuracy: 0.7429766885833832 Train accuracy 0.7944252640129975 * The number of max_depth : 4 * The number of actual depth: 4 Test accuracy: 0.7193664076509265 Train accuracy 0.8420999187652315 * The number of max_depth : 5 * The number of actual depth: 5 Test accuracy: 0.7393903167961745 Train accuracy 0.8623578391551584 * The number of max_depth : 6 * The number of actual depth: 6 Test accuracy: 0.7576210400478183 Train accuracy 0.8803310316815597 * The number of max_depth : 7 * The number of actual depth: 7 Test accuracy: 0.7904961147638971 Train accuracy 0.8956133225020309 * The number of max_depth : 8 * The number of actual depth: 8 Test accuracy: 0.7979677226539151 Train accuracy 0.9073415922014623 * The number of max_depth : 9 * The number of actual depth: 9 Test accuracy: 0.8156007172743575 Train accuracy 0.9214053614947197 * The number of max_depth : 10 * The number of actual depth: 10 Test accuracy: 0.8257621040047818 Train accuracy 0.930798131600325 * The number of max_depth : 11 * The number of actual depth: 11 Test accuracy: 0.8347280334728033 Train accuracy 0.9381600324939074
plt.figure(figsize=(10, 5))
plt.plot(range(1, 12), train_acc, marker="o", label="train accuracy")
plt.plot(range(1, 12), test_acc, marker="o", label="test accuracy")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
Depth 3 is also most appropriate in the model for demonstration.
Let's save with pickle
import pickle
# Load the model from the file
with open('../models/dt_model23.pkl', 'wb') as file:
pickle.dump(dt_model23, file)
The name will be made starting with a unit of 30.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
In this model, the model name is 31 and the variable name is 31
X31 = df_clean.drop(['SCORE', 'RESULTS_OF_SCORE', 'RESULTS_CODE'], axis=1)
y31 = df_clean['RESULTS_CODE']
# we have object columns like 'PROPERTY_TYPE', 'WARDNAME' , etc.
X31_encoded = pd.get_dummies(X31, columns=['PROPERTY_TYPE', 'WARDNAME', 'GRID'])
KNN can be sensitive to a feature's unit scale. This required us to normalize the data before using KNN classifiers.
y31
1 2
2 2
3 2
4 2
5 2
..
11755 2
11756 2
11757 2
11758 2
11759 2
Name: RESULTS_CODE, Length: 11152, dtype: int64
from sklearn.preprocessing import StandardScaler
X31 = StandardScaler().fit_transform(X31_encoded)
# Split the data
X_train31, X_test31, y_train31, y_test31 = train_test_split(X31_encoded, y31, test_size=0.3, random_state=1)
from imblearn.over_sampling import SMOTE
# SMOTE
smote = SMOTE(random_state=42)
print('The numbers of Building Audit(0) before:', X_train31[y_train31 == 0].shape[0])
print('The numbers of Building Audit(1) before:', X_train31[y_train31 == 1].shape[0])
print('The numbers of Building Audit(2) before:', X_train31[y_train31 == 2].shape[0])
print('The numbers of Building Audit(3) before:', X_train31[y_train31 == 3].shape[0])
# resampling
X_train_resampled31, y_train_resampled31 = smote.fit_resample(X_train31, y_train31)
print()
print('The numbers of Building Audit(0) after:', X_train_resampled31[y_train_resampled31 == 0].shape[0])
print('The numbers of Building Audit(1) after:', X_train_resampled31[y_train_resampled31 == 1].shape[0])
print('The numbers of Building Audit(2) after:', X_train_resampled31[y_train_resampled31 == 2].shape[0])
print('The numbers of Building Audit(3) after:', X_train_resampled31[y_train_resampled31 == 3].shape[0])
The numbers of Building Audit(0) before: 71 The numbers of Building Audit(1) before: 1746 The numbers of Building Audit(2) before: 4949 The numbers of Building Audit(3) before: 1040 The numbers of Building Audit(0) after: 4949 The numbers of Building Audit(1) after: 4949 The numbers of Building Audit(2) after: 4949 The numbers of Building Audit(3) after: 4949
# Instantiate the model & fit it to our data
KNN_model31 = KNeighborsClassifier(n_neighbors=3)
KNN_model31.fit(X_train31, y_train31)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)
# Score the model on the test set
train_predictions31 = KNN_model31.predict(X_train31)
test_predictions31 = KNN_model31.predict(X_test31)
train_accuracy31 = accuracy_score(train_predictions31, y_train31)
test_accuracy31 = accuracy_score(test_predictions31, y_test31)
print(f"Train set accuracy: {train_accuracy31}")
print(f"Test set accuracy: {test_accuracy31}")
Train set accuracy: 0.7950294645144761 Test set accuracy: 0.6413628212791392
When we put three(3) for n_neighbors parameter, It looks over-fitting because the train set accuracy is higher than the test set accuracy.
So, let's plot the relation between the accuracy and the number of neignbors to try the other numbers.
n_neighbor = range(1, 60, 10)
train_accuracy = []
test_accuracy = []
for n in n_neighbor:
# Instantiate the model & fit it to our data
KNN_model = KNeighborsClassifier(n_neighbors=n)
KNN_model.fit(X_train31, y_train31)
# Score the model and apppend in to the lists
train_accuracy.append(KNN_model.score(X_train31, y_train31))
test_accuracy.append(KNN_model.score(X_test31, y_test31))
plt.figure(figsize=(10,7))
plt.plot(n_neighbor, train_accuracy, color='b', label = 'train')
plt.plot(n_neighbor, test_accuracy, color='r', label = 'test')
plt.xlabel('K for KNN')
plt.ylabel('accuracy')
plt.legend()
plt.title('Accuracy of KNN vs. number of neibors')
plt.show()
The number 43 seems appropriate. Let's try 43.
In this model, the model name is 32 and we will use the variable named 31
# Instantiate the model & fit it to our data with parameter 43
KNN_model32 = KNeighborsClassifier(n_neighbors=43)
KNN_model32.fit(X_train31, y_train31)
# Score the model on the test set
train_predictions32 = KNN_model32.predict(X_train31)
test_predictions32 = KNN_model32.predict(X_test31)
train_accuracy32 = accuracy_score(train_predictions32, y_train31)
test_accuracy32 = accuracy_score(test_predictions32, y_test31)
print(f"Train set accuracy: {train_accuracy32}")
print(f"Test set accuracy: {test_accuracy32}")
Train set accuracy: 0.683192416090187 Test set accuracy: 0.682904961147639
As we review the number 43 is best for our model. According the result, our model predict with a probability of 68%.
from sklearn.model_selection import cross_val_score
# Instantiate the KNN model with the desired number of neighbors
KNN_model33 = KNeighborsClassifier(n_neighbors=43)
# Perform 5-Fold Cross Validation
cross_val_scores33 = cross_val_score(KNN_model33, X_train31, y_train31, cv=5)
# Print the cross-validation scores
print("Cross-Validation Scores:", cross_val_scores33)
# Calculate and print the mean and standard deviation of the cross-validation scores
mean_cv_score33 = cross_val_scores33.mean()
std_cv_score33 = cross_val_scores33.std()
print(f"Mean Cross-Validation Score: {mean_cv_score33}")
print(f"Standard Deviation of Cross-Validation Scores: {std_cv_score33}")
Cross-Validation Scores: [0.66965429 0.66816143 0.67905189 0.66752082 0.66880205] Mean Cross-Validation Score: 0.6706380968239112 Standard Deviation of Cross-Validation Scores: 0.0042657268177913295
We attempted cross-validation to improve accuracy, but it did not lead to improvement. This reason is assumed to be due to the fact that the model is not appropriate for our data or is too simple.
In the data, one sample scored 3 points in all evaluations, with an evaluation score of 100. This is not just a calculation of the evaluation index, but is predicted to have been evaluated in consideration of the registration year, building year, and location. However, in the regression model, machine learning seems to have created its own formula. Because the accuracy is so high that it's hard to see it as a prediction.
Therefore, in this case, it can be said that the classification model is suitable for prediction. Among them, the decision tree is the most accurate model, so it is suitable for us.